class: center, middle, inverse, title-slide # Introduction to Survey Data Cleaning Using Tidyverse in R ## Data Wrangling - Part 2 ### Johannes Breuer
Stefan Jünger ### 2021-07-22 --- layout: true <div class="my-footer"> <div style="float: left;"><span>Johannes Breuer, Stefan Jünger</span></div> <div style="float: right;"><span>ESRA 2021, 2021-07-22</span></div> <div style="text-align: center;"><span>Data Wrangling - Part 2</span></div> </div> --- ## Data wrangling continued 🤠 While in the last session we focused on changing the structure of our data by **selecting**, **renaming**, and **relocating** columns and **filtering** and **arranging** rows, in this part we will focus on altering the content of data sets by *adding* and *changing* variables and variable values. More specifically, we will deal with... - creating and computing new variables (in various ways) - recoding the values of a variable - dealing with missing values --- ## `dplyr::mutate()` <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\tidyverse-workshop-esra-2021\content\img\dplyr_mutate.png" width="60%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- ## Creating a new variable A simple example for creating a new variable is adding a numeric ID variable based on the row number in the data set. .small[ ```r gpc <- gpc %>% * mutate(id = row_number()) %>% relocate(id, .before = everything()) # move the id column before all other columns gpc %>% select(1:5) %>% glimpse ``` ``` ## Rows: 3,765 ## Columns: 5 ## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,~ ## $ cohort <dbl> 2, 1, 1, 2, 3, 2, 1, 2, 2, 2, 1, 3, 3, 1, 2, 1, 1, 1, 3, 2, 2, 2, 3, 1, 1, 1, 3, 3, 3, 1, 2, 2, 2, 1, 1, 1, 1,~ ## $ sex <dbl> 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2,~ ## $ age_cat <dbl> 10, 2, 8, 1, 7, 7, 7, 7, 8, 6, 9, 7, 2, 2, 7, 7, 7, 4, 10, 7, 4, 2, 4, 9, 9, 8, 10, 2, 7, 10, 7, 7, 7, 3, 7, 1~ ## $ education_cat <dbl> 3, 3, 1, 3, 3, 2, 3, 3, 3, 2, 2, 2, 3, 3, 2, 3, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 2, 3, 2, 3, 2, 3, 3, 2, 2,~ ``` ] *Note*: The function `rowid_to_column()` from the `tibble` package is a useful alternative for this particular task which also automatically includes the id variable as the first column. --- ## Recoding values Very often we want to recode values in a variable (e.g., if we have reverse-scored items as part of a scale). Say, for example, you want to recode the item from the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* that measures trust in scientists with regard to dealing with the Coronavirus so that it represents distrust instead. .small[ ```r gpc <- gpc %>% mutate(hzcy052aR = recode(hzcy052a, `5` = 1, # `old value` = new value `4` = 2, `2` = 4, `1` = 5)) gpc %>% select(hzcy052a, hzcy052aR) %>% head() ``` ``` ## # A tibble: 6 x 2 ## hzcy052a hzcy052aR ## <dbl> <dbl> ## 1 5 1 ## 2 5 1 ## 3 4 2 ## 4 NA NA ## 5 5 1 ## 6 98 98 ``` ] --- ## Missing values Most of the real datasets we work with have missing data. As the data can be missing for various reasons, we often use codes (and labels) to distinguish between different types of missing data. If you look at the the [codebook](https://dbk.gesis.org/dbksearch/download.asp?id=67378) of the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* or the [*GESIS Panel* Cheatsheet](https://www.gesis.org/fileadmin/upload/GESIS_Panel/Cheatsheet/gesis_panel_cheatsheet.pdf), you will see that there are quite a few types of and codes for missing data. Some types of missing values are the same across variables, while some variables also have additional types of missing data (and, hence, additional codes for missings). Notably, however, in the process of creating the synthetic data we use in this course, all values < 0 have been changed to `NA`.<sup>1</sup> .footnote[ [1] `NA` is a reserved term in `R`, meaning that you cannot use it as a name for anything else (this is also the case for `TRUE` and `FALSE`) ] --- ## Wrangling missing values When we prepare our data for analysis there are generally two things we might want/have to do with regard to missing values: - define specific values as missings (i.e., set them to `NA`) - recode `NA` values into something else (typically to distinguish between different types of missing values) --- ## Recode values as `NA` If you want to recode specific values as `NA`, you can use the is the `dplyr` function `na_if()`. You can do this for an entire dataframe...<sup>1</sup> ```r gpc <- gpc %>% na_if(98) # here we just provide the value to be recoded as NA for all vars ``` .footnote[ [1] Remember that all values < 0 have been set as `NA` in this synthetic dataset (unlike in the original data), but there are some values that contain the value 98 representing the response option *I don't know*. ] --- ## Recode values as `NA` ... for individual variables, in which case you need to combine `na_if()` with `mutate()`,... .small[ .pull-left[ *Before* ```r gpc %>% pull(hzcy044a) %>% # pull extracts a vector from a df table(useNA = "always") ``` ``` ## . ## 1 2 3 4 5 98 <NA> ## 33 178 345 1223 1287 89 610 ``` ] .pull-right[ *After* ```r gpc <- gpc %>% * mutate(hzcy044a = na_if(hzcy044a, 98)) gpc %>% pull(hzcy044a) %>% table(useNA = "always") ``` ``` ## . ## 1 2 3 4 5 <NA> ## 33 178 345 1223 1287 699 ``` ] ] --- ## Recode values as `NA` ... or for a range of variables. In this case you need to also add the helper function `across()` to the mix, which allows you to apply the same transformation to multiple columns. .small[ ```r gpc <- gpc %>% * mutate(across(hzcy044a:hzcy052a, ~na_if(.x, 98))) gpc %>% select(hzcy044a:hzcy052a) %>% summary() ``` ``` ## hzcy044a hzcy045a hzcy046a hzcy047a hzcy048a hzcy049a hzcy050a hzcy051a ## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00 ## 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:4.00 ## Median :4.000 Median :4.000 Median :4.000 Median :5.000 Median :4.000 Median :4.000 Median :4.000 Median :4.00 ## Mean :4.159 Mean :3.829 Mean :3.435 Mean :4.435 Mean :3.622 Mean :3.518 Mean :3.816 Mean :4.01 ## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.00 ## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00 ## NA's :699 NA's :866 NA's :748 NA's :673 NA's :622 NA's :622 NA's :626 NA's :655 ## hzcy052a ## Min. :1.000 ## 1st Qu.:4.000 ## Median :4.000 ## Mean :4.239 ## 3rd Qu.:5.000 ## Max. :5.000 ## NA's :661 ``` ] --- ## Alternatives for recoding values as `NA` If you want recode a range or selection of specific values as `NA`, there are two neat options from non-tidyverse packages: 1. `set_na()` from the [`sjlabelled` package](https://strengejacke.github.io/sjlabelled/index.html) which is generally very useful for working with labelled data (e.g., from *SPSS* or *Stata*) 2. `replace_with_na()` and its scoped variants, such as `replace_with_na_all()`, from the [`naniar` package](http://naniar.njtierney.com/index.html) 🦁 which is a great package for handling missing data in `R` (and works very well in combination with the `tidyverse` packages) --- ## Interlude: Exclude cases with missing values If you only want to keep complete cases in your dataframe, you can easily do so with the function `drop_na()` from the `tidyr` package. You can use the function to remove cases that have missings on any variable or only on specific variables. ```r nrow(gpc) ``` ``` ## [1] 3765 ``` ```r gpc_complete <- gpc %>% drop_na() nrow(gpc_complete) ``` ``` ## [1] 0 ``` ```r gpc %>% drop_na(choice_of_party) %>% nrow() ``` ``` ## [1] 3575 ``` --- ## Recode `NA` into something else For replacing `NA` with another value for a whole dataframe or specific variables you can use the `replace_na()` function from the `tidyr` package (in combination with `mutate()`). ```r gpc <- gpc %>% mutate(hzcy006a = replace_na(hzcy006a, -99)) ``` *Note*: If you want to replace `NA` with other values (via `replace_na()`) or vice versa (via `na_if()`) you can also select several variables with the usual selection helpers, such as `starts_with()` or `where()`. Of course, the particular example above does not make much sense. You can, however, specify different values for different types of missing values. To do this, you probably need to make the recoding dependent on other variables, which is what we will discuss next. --- ## Simple conditional transformation The simplest version of a conditional variable transformation is using an `ifelse()` statement. ```r gpc <- gpc %>% mutate(gender = ifelse(sex == 1, "male", "female")) # please excuse the synonymous use of sex & gender here gpc %>% select(sex, gender) %>% sample_n(5) # randomly sample 5 cases from the df ``` ``` ## # A tibble: 5 x 2 ## sex gender ## <dbl> <chr> ## 1 2 female ## 2 1 male ## 3 2 female ## 4 1 male ## 5 1 male ``` .small[ *Note*: A more versatile option for creating dummy variables is the [`fastDummies` package](https://jacobkap.github.io/fastDummies/). ] --- ## Advanced conditional transformation For more flexible (or complex) conditional transformations, the `case_when()` function from `dyplyr` is a powerful tool. ```r gpc <- gpc %>% mutate(pol_leaning_cat = case_when( between(political_orientation, 0, 3) ~ "left", between(political_orientation, 4, 7) ~ "center", political_orientation > 7 ~ "right" )) gpc %>% select(political_orientation, pol_leaning_cat) %>% sample_n(5) ``` ``` ## # A tibble: 5 x 2 ## political_orientation pol_leaning_cat ## <dbl> <chr> ## 1 0 left ## 2 2 left ## 3 8 right ## 4 3 left ## 5 7 center ``` --- ## `dplyr::case_when()` A few things to note about `case_when()`: - you can have multiple conditions per value - conditions are evaluated consecutively - when none of the specified conditions are met for an observation, by default, the new variable will have a missing value `NA` for that case - if you want some other value in the new variables when the specified conditions are not met, you need to add `TRUE ~ value` as the last argument of the `case_when()` call - to explore the full range of options for `case_when()` check out its [online documentation](https://dplyr.tidyverse.org/reference/case_when.html) or run `?case_when()` in `R`/*RStudio* --- ## `dplyr::case_when()` <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\tidyverse-workshop-esra-2021\content\img\dplyr_case_when.png" width="95%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- ## Aggregate variables Something we might want to do for our analyses is to create aggregate variables, such as sum or mean scores for a set of items. As `dplyr` operations are applied to columns, whereas such aggregations relate to rows (i.e., respondents), we need to make use of the function `rowwise()`. Say, for example, we want to compute a sum score for all measures that respondents have reported to engage in to prevent an infection with or the spread of the Corona virus. ```r gpc <- gpc %>% * rowwise() %>% mutate(sum_measures = sum(c_across(hzcy006a:hzcy016a), na.rm = TRUE)) %>% ungroup() ``` --- ## Aggregate variables ```r gpc <- gpc %>% rowwise() %>% * mutate(sum_measures = sum(c_across(hzcy006a:hzcy016a), na.rm = TRUE)) %>% * ungroup() ``` Three things to note here: 1. `c_across()` is a special version of `across()`for rowwise operations. 2. We use the `ungroup()` function at the end to ensure that `dplyr` verbs will operate the default way when we further work with the `gpc` object. We do not cover grouping in this course (which is especially valuable for summarizing data), but you can check out the [documentation for `group_by()`](https://dplyr.tidyverse.org/reference/group_by.html) to learn more about this. 3. If you only need sums or means, a somewhat faster alternative is using the base `R` functions `rowSums()` and `rowMeans()` in combination with `mutate()` (and possibly also `across()` plus selection helpers). For an explanation why this can be faster, you can read the [online documentation for `rowwise()`](https://dplyr.tidyverse.org/articles/rowwise.html). --- ## Aggregate variables ```r gpc %>% select(hzcy006a:hzcy016a, sum_measures) %>% glimpse() ``` ``` ## Rows: 3,765 ## Columns: 12 ## $ hzcy006a <dbl> 1, 1, 1, NA, 1, 1, 1, NA, NA, 1, 1, NA, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, NA, 1, 1, NA, 0, 1, 1, 0, NA, NA~ ## $ hzcy007a <dbl> 1, 1, 1, NA, 0, 1, 1, NA, NA, 0, 1, NA, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, NA, 1, 1, NA, 0, 1, 1, 1, NA, NA~ ## $ hzcy008a <dbl> 0, 0, 1, NA, 0, 0, 1, NA, NA, 0, 1, NA, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, NA, 0, 1, NA, 0, 1, 1, 0, NA, NA~ ## $ hzcy009a <dbl> 0, 0, 0, NA, 0, 0, 0, NA, NA, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, NA, 0, 0, 0, 0, NA, NA~ ## $ hzcy010a <dbl> 0, 1, 0, NA, 0, 0, 0, NA, NA, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, NA, 0, 1, 0, 0, NA, NA~ ## $ hzcy011a <dbl> 1, 1, 1, NA, 1, 1, 1, NA, NA, 1, 1, NA, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, NA, 1, 1, NA, 0, 1, 1, 1, NA, NA~ ## $ hzcy012a <dbl> 0, 1, 1, NA, 1, 1, 1, NA, NA, 1, 1, NA, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, NA, 0, 1, NA, 1, 0, 1, 0, NA, NA~ ## $ hzcy013a <dbl> 1, 0, 0, NA, 0, 0, 0, NA, NA, 1, 1, NA, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, NA, 0, 1, NA, 0, 0, 0, 0, NA, NA~ ## $ hzcy014a <dbl> 1, 1, 1, NA, 1, 1, 1, NA, NA, 1, 1, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, NA, 1, 1, NA, 0, 1, 1, 0, NA, NA~ ## $ hzcy015a <dbl> 0, 0, 0, NA, 0, 0, 0, NA, NA, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, NA, 0, 0, 0, 0, NA, NA~ ## $ hzcy016a <dbl> 0, 0, 0, NA, 0, 0, 0, NA, NA, 0, 0, NA, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, NA, 0, 0, 0, 0, NA, NA~ ## $ sum_measures <dbl> 5, 6, 6, 0, 4, 5, 6, 0, 0, 5, 7, 0, 3, 5, 5, 5, 6, 7, 3, 5, 4, 4, 6, 3, 5, 0, 4, 7, 0, 1, 6, 6, 2, 0, 0, 5, 5, ~ ``` --- ## Aggregate variables Rowwise transformations work the same way for means. Here, we create a mean score for the items that ask how much people trust specific people or institutions in dealing with the Corona virus. ```r gpc <- gpc %>% rowwise() %>% mutate(mean_trust = mean(c_across(hzcy044a:hzcy052a), na.rm = TRUE)) %>% ungroup() ``` --- ## Aggregate variables ```r gpc %>% select(hzcy044a:hzcy052a, mean_trust) %>% glimpse() ``` ``` ## Rows: 3,765 ## Columns: 10 ## $ hzcy044a <dbl> 2, 4, 4, NA, 5, 4, 4, NA, NA, 4, 5, NA, 3, 5, 5, 5, 4, 3, 5, 3, 5, NA, 4, 4, 5, NA, 4, 5, NA, NA, 5, 5, 4, NA, NA~ ## $ hzcy045a <dbl> 4, 5, 4, NA, 4, 4, 3, NA, NA, 4, 5, NA, 3, 4, 4, 4, 4, 3, 3, 3, 4, 3, 4, 5, 4, NA, 4, 3, NA, NA, 5, 4, 4, NA, NA,~ ## $ hzcy046a <dbl> 4, 5, 4, NA, 4, 4, 3, NA, NA, 1, 4, NA, 3, 3, 3, 3, 2, 3, 4, 3, 3, 3, 2, 4, 4, NA, 4, 3, NA, 2, 4, 4, 3, NA, NA, ~ ## $ hzcy047a <dbl> 5, 5, 4, NA, 4, 5, 4, NA, NA, 5, 5, NA, 5, 4, 5, 5, 4, 5, 4, 4, 5, 4, 5, 4, 4, NA, 5, 4, NA, 2, 5, 5, 2, NA, NA, ~ ## $ hzcy048a <dbl> 5, 5, 4, NA, 4, 4, 4, NA, NA, 4, 4, NA, 3, 4, 4, 4, 2, 2, 3, 2, 4, 3, 4, 4, 4, NA, 4, 4, NA, 1, 5, 5, 2, NA, NA, ~ ## $ hzcy049a <dbl> 5, 5, 4, NA, 4, 4, 4, NA, NA, 4, 4, NA, 3, 4, 2, 4, 1, 2, 2, 2, 4, 2, 4, 4, 4, NA, 5, 4, NA, 1, 5, 5, 2, NA, NA, ~ ## $ hzcy050a <dbl> 5, 5, 4, NA, 4, 4, 4, NA, NA, 5, 5, NA, 4, 4, 5, 4, 3, 4, 3, 3, 4, 3, 4, 4, 4, NA, 5, 4, NA, 1, 5, 5, 4, NA, NA, ~ ## $ hzcy051a <dbl> 5, 5, 4, NA, 4, 4, 4, NA, NA, 5, 5, NA, 5, 4, 5, 3, 1, 5, 4, 3, 5, 3, 5, 4, 4, NA, 4, 4, NA, 1, 5, 5, 4, NA, NA, ~ ## $ hzcy052a <dbl> 5, 5, 4, NA, 5, NA, 4, NA, NA, 2, 4, NA, 5, 4, 5, 4, 2, 5, 2, 3, 5, 2, 5, 4, 5, NA, 4, 4, NA, 3, 5, 5, NA, NA, NA~ ## $ mean_trust <dbl> 4.444444, 4.888889, 4.000000, NaN, 4.222222, 4.125000, 3.777778, NaN, NaN, 3.777778, 4.555556, NaN, 3.777778, 4.0~ ``` --- ## Other variable types In the examples in this session, we only worked with numeric variables.<sup>1</sup> There are specialized `tidyverse` packages offering advanced wrangling options for other types of variables: - [`forcats`](https://forcats.tidyverse.org/) for factors - [`stringr`](https://stringr.tidyverse.org/) for strings/character variables - [`lubridate`](https://lubridate.tidyverse.org/) for time and dates .footnote[ [1] Although we also created two string/character variables. ] <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\tidyverse-workshop-esra-2021\content\img\dplyr_stringr_lubridate.png" width="60%" style="display: block; margin: auto;" /> --- class: center, middle # [Exercise](https://jobreu.github.io/tidyverse-workshop-esra-2021/exercises/Exercise_4.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/tidyverse-workshop-esra-2021/solutions/Exercise_4.html)